Kernel Bayesian nonlinear matrix factorization based on variational inference for human–virus protein–protein interaction prediction

Identification of potential human–virus protein–protein interactions (PPIs) contributes to the understanding of the mechanisms of viral infection and to the development of antiviral drugs. Existing computational models often have more hyperparameters that need to be adjusted manually, which limits their computational efficiency and generalization ability. Based on this, this study proposes a kernel Bayesian logistic matrix decomposition model with automatic rank determination, VKBNMF, for the prediction of human–virus PPIs. VKBNMF introduces auxiliary information into the logistic matrix decomposition and sets the prior probabilities of the latent variables to build a Bayesian framework for automatic parameter search. In addition, we construct the variational inference framework of VKBNMF to ensure the solution efficiency. The experimental results show that for the scenarios of paired PPIs, VKBNMF achieves an average AUPR of 0.9101, 0.9316, 0.8727, and 0.9517 on the four benchmark datasets, respectively, and for the scenarios of new human (viral) proteins, VKBNMF still achieves a higher hit rate. The case study also further demonstrated that VKBNMF can be used as an effective tool for the prediction of human–virus PPIs.

In addition, experimental methods are not only time-consuming and laborious, but also difficult to obtain a complete protein interactome 8 .As the number of virus-host PPIs continues to increase, computational models for the prediction of interspecies PPIs have also received increasing attention 9 .Yang et al. 8 utilized doc2vec to represent protein sequences as rich low-dimensional feature vectors, and used random forests to perform predictions, and the results showed that the prediction performance of this method was better than that of SVM, Adaboost and Multiple Layer Perceptron.Yang et al. 10 combined evolutionary sequence features with Siamese convolutional neural network architecture and multi-layer perceptron, introduced two transfer learning methods (namely "frozen" type and "fine-tuned" type), and successfully applied them to the prediction of virus-human PPIs by retraining CNN layer.To predict potential human-virus PPIs, Tsukiyama et al. 11 used word2vec to obtain low-dimensional features from amino acid sequences and developed an LSTM-based prediction model.The above supervised learning methods effectively use the sequence information of proteins, and have achieved some success in the prediction of virus-human PPIs.However, most of these methods require negative sampling to generate training sets, which inevitably leads to false negative samples in the training set.In addition, these models often need to ensure a balanced ratio of positive and negative samples when performing training, and do not make full use of a large number of other unknown interactions, which also limits the predictive ability of the models to a certain extent.
In recent years, more and more network models for predicting interaction relationships have been proposed.Based on multiple similarity kernels for viral (or human) proteins, Nourani et al. 12 proposed an adaptive multikernel preservation embedding (AMKPE) approach to perform predictions.The results show that AMKPE achieves better performance than some supervised learning methods.In the previous study, we proposed a sequence ensemble-based virus-human PPIs prediction method (Seq-BEL) 13 , which integrated sequence feature information and network structure into the ensemble learning model to improve the prediction ability and stability.Recently, for the prediction of human-virus PPIs under various disease types, we proposed a logical tensor decomposition model with sparse subspace learning 14 , which introduced logical functions and feature information into CP decomposition to improve the prediction ability of human-virus-disease triples.In addition, some other binary interaction prediction methods also provide reference for the prediction of virus-human PPIs.Peska et al. 15 proposed a Bayesian ranking model for predicting drug-target interactions based on Bayesian personalized ranking matrix factorization, which showed good predictive performance on multiple benchmark datasets.Sharma et al. 16 proposed a bagging based ensemble framework for drug-target interaction prediction, which employ reduction and active learning to deal with class imbalance data, showing excellent performance compared with other five competing methods.Ding et al. 17 proposed a dual Laplacian regularized least squares (DLapRLS) model for drug-target interaction prediction, which utilized the Hilbert-Schmidt Independence Criterion-based Multiple Kernel Learning (HSIC-MKL) to linearly integrate the corresponding kernels in drug space and target space, respectively, and established a drug-target interactive prediction model by DLapRLS.Yu et al. 18 proposed an end-to-end graph deep learning approach (LAGCN) that utilized GCN to capture structural information from heterogeneous networks of drugs and diseases, and introduced attentional mechanisms to combine embeddings from different convolutional layers for drug-disease association prediction.Zhao et al. 19 proposed an improved Graph representation learning method (iGRLDTI), which solves the oversmoothing problem of graph neural networks (GNN) by better capturing the more discriminant features of drugs and targets in the potential feature space.The above model makes full use of the network structure of biological entities and improves the predictive ability of the model.However, most of the above models contain more hyperparameters, and the parameter adjustment before the experiment affects the prediction efficiency and generalization ability of the model to a certain extent.
Therefore, this study proposes a kernel Bayesian nonlinear matrix factorization based on variational inference, VKBNMF, for human-virus PPIs prediction.To reduce the sparsity of the interaction network and improve the accuracy of the similarity network, we extract the kernel neighborhood similarity from the completed virus-human PPIs network, and fused it with the sequence similarity of the viral (or human) protein to obtain a more accurate network structure.Secondly, to improve the learning ability of the model, we introduce the weighted logistic function into kernel Bayesian Matrix Factorization, and adaptively determine the rank of lowdimensional features by combining the sparsity-inducing priors of multiple latent variables.Finally, to solve the problem of integrating latent variables and ensuring the efficiency of the solution, we establish a variational inference framework to implement the model solution.Results on three experimental scenarios in four real data sets demonstrate the effectiveness of VKBNMF in predicting potential human-viral PPIs.Furthermore, the case study further demonstrates that VKBNMF can be used as an effective tool for human-viral PPIs prediction.

Method review
To explore virus-human potential PPIs, we propose a new method named VKBNMF, which mainly consists of three steps (as shown in Fig. 1).Firstly, a variety of similarity networks are constructed based on protein sequences and trained human-virus PPIs networks, and are fused to obtain more accurate similarity of viral (or human) proteins (as shown in step 1 of Fig. 1).Secondly, the Bayesian framework of logical matrix factorization is established, and the auxiliary information of human (or viral) protein and the prior probabilities of latent variables are introduced, and then the probability graph model of VKBNMF is constructed (as shown in step2 in Fig. 1).Finally, variational inference is used to perform the solution of VKBNMF to realize the prediction of potential PPIs of human-virus (as shown in step 3 in Fig. 1).

Network construction
Let Y ∈ R M×N represent the interaction matrix of M human proteins and N viral proteins.When there is an interaction between the ith human protein and the jth viral protein, then Y ij = 1 , otherwise Y ij = 0 .S h seq (or S v seq ) represents sequence similarity of human (or viral) proteins, respectively.The task at hand is to predict potential interactions in Y.
According to previous research, reasonably extracting information from known interaction networks can enhance the accuracy of the network, thereby improving the predictive ability of the model 13,[20][21][22] .However, the existing interactive networks are very sparse, and the information contained is more focused on well-studied samples, and extracting information directly from them will contain more noise.Therefore, drawing on the method of Xiao et al. 23 , based on S h seq and S v seq , we utilize weighted k nearest neighbor profiles (WKNNP) to initially complete the trained Y to obtain Y .In previous studies, we proposed a network construction method based on kernel neighborhood similarity (KSNS) 24,25 , which can hierarchically integrate neighborhood and non-neighborhood information and mine nonlinear relationships of samples, and has been well applied in some biological relationship prediction problems 20,21,26,27 .KSNS calculates the similarity as follows: where �(•) represents kernal transformation, and Gaussian function is selected in this paper.� • � F denotes F-norm, and ⨀ is an element-by-element multiplication.µ 1 and µ 2 represent regularization parameters, accord- ing to previous studies 21,27,28 , µ 1 = 4 and µ 2 = 1 .According to (1), when X = Y , the interaction profile similarity S h int of human protein can be obtained; when X = Y T , the interaction profile similarity S v int of viral proteins can be obtained.
Then, we obtain two similarities of human proteins ( S h seq , S h int ) and two similarities of viral proteins ( S v seq , S v int ), which both measure the relationship of human (or viral) proteins from different aspects.To obtain a more accurate network structure, clusDCA 29 is used to fuse S h seq and S h int to obtain the final human protein similarity S h , and S v seq and S v int to obtain the final viral protein similarity S v .

VKBNMF
Liu et al. 30 introduced neighborhood similarity into logical matrix factorization, and obtained a neighborhood regularized logical matrix factorization model (NRLMF), which and its variants are well applied to the interaction relationship prediction of various biological entities 28,30,31 .However, NRLMF needs to undergo tedious hyperparameter tuning before performing prediction tasks, which not only affects computational efficiency, but may also lead to overfitting.This paper establishes a Bayesian framework based on LMF, takes hyperparameters as latent variables, and introduces prior probability, so that the model can adaptively search for the optimal solution, avoid tedious hyperparameter debugging, and improve prediction performance and generalization ability.
( Let G ∈ R M×R and H ∈ R N×R represent the factor matrices of human proteins and viral proteins respectively, then the interaction relationship between the m th human protein and the nth viral protein satisfies the Bernoulli distribution, and the density function can be expressed as: where, σ (•) represents the sigmoid function, G m. and H n. represent the m th row of G and the n th row of H , respectively.NRLMF considers that known interactions are more important and need to be assigned higher weights.Meanwhile, assuming that all training samples are independent, the weighted conditional probability density of Y can be expressed as: where, c ≥ 1 represents the importance level.Figure 2 demonstrates the probabilistic graphical model of VKBNMF with latent variables and corresponding priors.
From Fig. 2, the probability of occurrence of Y is calculated from the factor matrix G and H by ( 3).The prob- ability distributions of factor matrices G and H are obtained from U ∈ R M×R and V ∈ R N×R by integrating two types of auxiliary information K u (e.g. S h ) and K v (e.g. S v ).σ g , σ h and are precision parameters, while α and β are hyperparameters.In this section, we specify priors on all latent variables and parameters.
In general, the effective dimension R of the latent space (e.g. the effective column dimensions of U and V ) is a tuning parameter whose selection is quite challenging and computationally expensive.In order to both infer the value of R and avoid overfitting, we introduce automatic rank determination into the prior distributions of U and V 32 .Specifically, it is assumed that each column of U and V is independent, and its rth column satisfies the vector with a mean value of 0, and the precision matrix is the Gaussian prior of r I M and r I N , respectively, as follows: where I M ∈ R M×M and I N ∈ R N×N represent the identity matrix, U •r and V •r represent the r th column of U and V , respectively.[ 1 , 2 , . . ., R ] constitutes the precision vector ∈ R 1×M .r controls the r column of and V .When r is large, Ur and Vr both approach 0, indicating that they make little contribution to Y and can be removed from U and V.This process can realize the automatic determination of R .For the precision vector , the conjugate Gamma hyperprior is defined as follows: where, Gamma(x|α, β) = β α Ŵ(α) x α−1 e −βx is the Gamma distribution, and {α, β} are the two parameters of the Gamma distribution.In this study, no information prior is selected 33 , that is, α = 1 , β = 1 .In order to effectively integrate the auxiliary information, let the elements in the factor matrix G be independent, and the (m, r)th ele- ment G m,r satisfies the Gaussian distribution with the expectation of K u m• U •r and precision σ g , as follows: Similarly, according to K v and V , the prior probability of H is as follows: (2) where, σ h is the precision parameter.Here, σ g and σ h satisfy the Jeffreys prior According to the probability graph model described in Fig. 1, combined with the likelihood function in (3), the priors of U and V in ( 4) and ( 5), the priors of precision vector in (6), the priors of factor matrix G and H in (7) and (8), and the priors of precision σ g and σ h in ( 9) and ( 10), the joint distribution of VKBNMF is given by: Let � = G, H, U, V , , σ g , σ h represent the set of all potential variables, and our goal is to compute the complete posterior distribution of all potential variables given Y

Model Inference of VKBNMF
The accurate solution of (12) requires the integration of all potential variables, which is computationally intractable.Therefore, this study employs variational inference to obtain the approximate posterior distribution q(�) for P(�|Y ) .The principle of variational inference is to define a set of parameter distributions on latent variables and update the parameters to minimize the Kullback-Leibler (KL) distance between P(�|Y ) and q(�) 34 where lnP(Y ) represents model evidence and its lower bound is defined as L q = q(�)ln P(�,Y ) q(�) d� .According to the mean field approximation, q(�) can be decomposed into When the other variables are fixed, the optimal posterior estimate of q(� k ) is defined as follows: where, E[•] represents expectation, and const represents a constant that does not depend on the current variable.
\ k represents the set after deleting k .All variables are updated sequentially while keeping other variables constant.
1) Estimate the latent variable : Combining the respective priors of U , V and in (4), ( 5) and ( 6), the posterior approximation Lnq( r ) is derived from (15) as From ( 16), it is found that the posterior density of the r obey the Gamma distribution where α r and β r represent the posterior parameters as follows: The required expectations here are found as where U •r and V •r represent the posterior expectation of U •r and V •r , respectively.�(U •r ) and �(V •r ) represent the posterior covariance matrix of U •r and V •r , respectively.tr(•) represents the trace of a matrix.
2) Estimate latent variables U and V : Substituting the priors of the latent variables U and G into (15), the posterior approximation of Lnq(U •r ) is obtained as follows (see section 1 of Appendix for details): where I M ∈ R M×M is the identity matrix and G •r represents the r column of G .From (20), it is found that U •r follows a multivariate Gaussian distribution The posterior expectation U •r and the covariance matrix �(U •r ) are as follows: Similarly, the posterior of V •r follows a multivariate Gaussian distribution Its expectation and covariance matrix are 3) Estimate latent variables G and H : The likelihood function in (3) contains the exponential form of G m. , resulting in no conjugate prior.Therefore, referring to 35 , we utilize the following approximation.
Then, the log likelihood of Y m,n satisfies where ξ m,n represents the local variational parameter.It can be seen that h(ξ m,n , G m. , H n. ) is a quadratic func- tion of G m. and is the lower bound of the log likelihood.By replacing P Y m,n |G m. , H n. with h(ξ m,n , G m. , h n ) and combining (7) and (15), it can be found that the posterior of G m. satisfies the multivariate Gaussian distribu- tion q(G m. ) = N (G m. | G m. , �(G m. )) , and its expectation and covariance matrix are given by (see section 2 of Appendix for details).
where, H represents the expectation of H,a m,n = cy mn −1+y mn 2 ,b m,n = cy mn + 1 − y mn ξ m,n .Similarity, the posterior of H n. satisfies the multivariate Gaussian distribution q(H n. ) = N H n. | H n. , �(H n. ) , its expectation and covariance matrix are given by ( 19) Vol.:(0123456789) where, G represents the expectation of G. 4) Estimate latent variables σ g and σ h : Substituting ( 7) and ( 9) into (15), the approximate posterior of lnq σ g is as follows: Therefore, the posterior distribution of σ g is a Gamma distribution with expectation where, a g and b g are the posterior parameters of σ g , refer to Theorem 1 in the appendix, E �G − K u U� 2 is given by Similarity, the posterior distribution of σ h is a Gamma distribution with expectation where, a h and b h are the posterior parameters of σ h , E �H − K v V � 2 is obtained similarly to formula (31).
5) Update local variational parameter ξ m,n : According to (26), Ln(h(ξ m,n , G m. , H n. )) takes the derivative of ξ m,n and sets its derivative equal to 0 to obtain the optimal value of ξ m,n as follows (see section 4 of Appendix for details) where, vec(•) represents converting a matrix into a row vector.
In summary, the optimization algorithm for solving VKBNMF is shown in Algorithm 1.
Update the local variational parameter using (33).
Until convergence.

Data extraction
The MorCVD database covers 19 microbial-induced cardiovascular diseases including endocarditis, myocarditis, and pericarditis, as well as 23,377 interactions between 3957 viral proteins of 432 viruses and 3202 human proteins 36 .We took vascular disease as the key word, and downloaded the human-virus PPIs of various diseases one by one from the database.To ensure that as many human (or virus) proteins as possible are covered in the dataset, we remove disease types that contain less than 100 human (or viral) proteins.Finally, the human-virus PPIs under the four disease types (corresponding to the four benchmark data sets) are obtained, as shown in Table 1.
From Table 1, the known interactions contained in the four benchmark datasets are very sparse (accounting for less than 1%).To obtain additional auxiliary information, we extracted amino acid sequences of these proteins from the UniProt database 37 by R package "protr" 38 , and calculated the pseudo-amino acid composition 39 (abbreviated as PseAAC) feature of human (or viral) proteins according to the regularization frequency of amino acids.Further, according to the PseAAC feature, KSNS is used to construct the sequence similarity of human (or viral) proteins.In summary, the four benchmark datasets in this study contain human-virus PPIs under four disease types, as well as the sequence similarity S h seq (or S v seq ) of the corresponding human (or viral) proteins.

Experimental settings
To examine the prediction ability of the model for human-virus PPIs, new human proteins and new viral proteins, we performed fivefold crossover validation in 3 different scenarios according to previous studies [26][27][28]40 .
Table 1.The statistics of the four datasets."H_num" indicates the number of human proteins, "V_num" indicates the number of virus proteins, "I_num" indicates the number of interactions, "Prop" indicates the proportion of known interactions."CI" indicates the disease "Cardiovascular Infections", "DC" refers to "Dilated Cardiomyopathy", "ED" refers to "Endocarditis" and "VM" refers to "Viral Myocarditis".(1) "Pairwise interaction" scenario: Evaluate the predictive power with respect to human-viral PPIs.The known interactions of Y are randomly divided into 5 equal parts, four of which are used for training and the remaining part is used for testing.
(2) "Human Protein" Scenario: Evaluate the predictive power with respect to human proteins.The rows of Y are randomly divided into five equal parts, four of which are used for training and the remaining one is used for testing.
(3) "Viral Protein" Scenario: Evaluate the predictive power with respect to viral proteins.The columns of Y are randomly divided into five equal parts, four of which are used for training and the remaining one is used for testing.
For the "Pairwise interaction" scenario, refer to previous studies [40][41][42][43] , and select the average AUPR value, AUC value and F1 value of fivefold cross validation as evaluation indicators.For the "Human protein" and "Viral Protein" scenarios, more attention is often paid to the top-ranked candidate interactions, namely the hit rate 12,13,40 , which is calculated as follows: (34)  where, N represents the number of elements contained in the test set, ρ represents the scale factor, which is {2%, 6%, 10%} in this study, and

Hyperparameter analysis
The importance level parameter c is the only important hyperparameter of VKBNMF.To analyse the effect of c on the prediction performance, we employ the grid method.Let c be taken from 2 0 , 2 1 , . . ., 2 6 , and perform a fivefold cross validation on the four benchmark datasets for the "pair interaction" scenario, the predicted AUPR values of the model are shown in Fig. 3. From Fig. 3, the importance level parameter c has a significant effect of prediction performance on all four benchmark datasets.When c = 1 , i.e., known and unknown interactions are considered equally important, the models have the lowest AUPR on all four benchmark datasets.As c increases, the prediction performance gradu- ally improves, and when c reaches 2 4 , the AUPR of all models gradually stabilises.Therefore, this study makes the hyperparameter c take the value of 16 and performs subsequent experiments.The above analyses also show that the introduction of importance level can improve the prediction performance to some extent.

Comparison experiments
To comprehensively evaluate the prediction performance of VKBNMF, we select 6 state-of-the-art interactive prediction models.Four advanced network models are Kernel Bayesian Matrix Factorization (KBMF) 44 , Hypergraph Logical Matrix Factorization (HGLMF) 28 , Generalized Matrix Factorization Based on Weighted Hypergraph Learning (WHGMF) 40 , Dual Laplace Regularized Least Squares (DLapRLS) 17 .Two state-of-the-art deep learning methods are Layer Attention Graph Convolutional Networks (LAGCN) 18 and Graph Attention Networks and Dual Laplacian Regularized Least Squares (MKGAT) 45 .It should be noted that, in order to ensure the fairness of the comparison, we employ the method described in "Network construction" section to build the network for all models, and utilize the optimal parameters provided in the original code to perform prediction.Under the "Pairwise interaction" scenario, the prediction results of all models on the four benchmark datasets are shown in Table 2.
As shown in Fig. 4, for the "new human protein" and "new virus protein" scenarios, VKBNMF shows excellent performance on most datasets.Specifically, for the "new human protein" scenario, the hit rate of VKBNMF on CI  In summary, whether it is "Pairwise interaction" scenario, "Human protein" scenario, or "Viral Protein" scenario, VKBNMF has shown excellent predictive performance on most data sets.The main reasons are as follows: Firstly, compared with other discriminant models, generative models (VKBNMF and KBMF) regard parameters as latent variables and realize adaptive parameter solving through variational inference, which not only avoids tedious parameter debugging, but also has considerable generalization ability and robustness.Secondly, compared with KBMF, VKBNMF improves the prediction performance by introducing nonlinear functions and importance levels into Bernoulli distributions.Finally, VKBNMF introduces automatic rank determination to realize adaptive learning of the effective dimension R of the latent space, and sets an uninformative prior for the accuracy parameter to avoid manual search and improve computational efficiency.

Robustness analysis
To assess the robustness of the models, we calculate the average AUPR, AUC and F1 values for all models under 20 different random seeds with respect to the fivefold cross-validation, and the results are shown in Table 2.We also draw a boxplot in Fig. 5, showing statistics for the AUPR, AUC, and F1 values of VKBNMF across 20 random seeds.Since the mean values on the variance of AUPR, AUC and F1 values on the four datasets are 1.644 × 10 -5 , 1.727 × 10 -5 and 1.938 × 10 -5 , which indicates that the VKBNMF exhibits good robustness.
Furthermore, we perform paired Wilcoxon rank-sum tests of VKBNMF with other predictive models in terms of AUC, AUPR, and F1 scores, and the results are shown in Table 3. Obviously, VKBNMF significantly outperforms other prediction models at 95% confidence level (p-value < 0.05) on all datasets.It demonstrates again the significant superiority of VKBNMF in the prediction of human-viral PPIs.

Case study
This section selects three common viruses, Epstein-Barr virus, Influenza A virus, and Human papillomavirus, as case studies to explore these viral diseases and the interaction between their viral proteins and human proteins.For each virus, we deleted all PPIs with human proteins under the corresponding disease and performed VKBNMF to obtain predicted interaction probabilities.Based on the experimental prediction scores, we obtained the top 10 PPIs with the highest probability of interacting with the virus.Then, the predicted PPIs were tested against evidence obtained from various databases of human-virus PPIs (e.g.MINT, VirHostNet, IntAct, and BioGRID, etc.).As a result, 8, 9, and 8 of the top 10 PPIs for Epstein-Barr virus, Influenza A virus, and Human papillomavirus were verified, respectively.
Epstein-Barr virus (EBV), also known as Human gammaherpesvirus 4, is a member of the herpesvirus family, which is a double-stranded DNA virus and one of the most common human viruses 46 .EBV is found all over the world, which is generally transmitted through body fluids, mainly saliva.This virus is closely related to non-gonorrheal malignancy such as gastric cancer and nasopharyngeal cancer 47 , as well as children's Alice in Wonderland syndrome 48 and acute cerebellar ataxia 49 .Calderwood et al. 50found that human proteins targeted by EBV proteins are rich in highly connected or hub proteins, and the targeting center may be an effective mechanism for EBV recombination in cellular processes.In this study, all interactions between Epstein-Barr virus (Taxonomy ID is 10376) and human proteins under Cardiovascular Infections were deleted, and 8 of the top 10 PPIs predicted by VKBNMF were verified, as shown in Table 4.
Influenza A subtype H5N1 is a subtype of influenza A virus that causes disease in humans and many other species 51 .Handling infected poultry is a risk factor for H5N1 infection, and about 60 percent of humans known to be infected with the Asian strain of H5N1 have died from the virus.Furthermore, H5N1 may mutate or www.nature.com/scientificreports/recombine into a strain capable of efficient human-to-human transmission 52 .Due to its high lethality, endemic existence, and continuous major mutations, H5N1 was once considered the world's greatest pandemic threat, and countries around the world spent a lot of manpower and material resources on H5N1 research.In this study, all interactions between H5N1 (Taxonomy ID is 284218) and human proteins were deleted under Viral Myocarditis disease, and 9 of the top 10 PPIs with interaction probability predicted by VKBNMF were verified, as shown in Table 5.
Human papillomavirus (HPV) infection is one of the most common sexually transmitted diseases and has been associated with cancers such as cervical, head and neck squamous cell carcinoma (HNSCC), and anal cancer 53 .HPV infection is mainly transmitted through skin-to-skin or skin-to-mucosal contact 54 .HPV 16 is the most common high-risk type of HPV, which causes a trusted source of 50% of cervical cancers worldwide, and usually does not cause any noticeable symptoms, although it can bring about cervical changes 55 .In this study, all interactions between HPV 16 (Taxonomy ID is 333760) and human proteins were deleted under Viral Myocarditis disease, and 8 of the top 10 PPIs with interaction probability predicted by VKBNMF were verified, as shown in Table 6.

Discussion
This study proposes a novel human-virus PPIs prediction method named kernel Bayesian nonlinear matrix factorization based on variational inference (VKBNMF).The novelty of this method is to establish a Bayesian framework of nonlinear matrix factorization and introduce auxiliary information to improve the predictive ability of new proteins.Meanwhile, VKBNMF takes model parameters as latent variables, and realizes the adaptive solution of parameters by inferring its posterior probability, avoiding tedious parameter debugging and enhancing the generalization ability of the model.In addition, this study builds a variational framework for model solving, which ensures the efficiency of solving large-scale data.
To evaluate the performance of VKBNMF, we conducted extensive experiments on multiple benchmark datasets and various experimental scenarios.The experimental results found that for the "Pairwise interaction" scenario, except for the CI dataset, VKBNMF achieved better AUPR, AUC and F1 values on the other three datasets.Under the "Human protein" scenario, the hit rates of VKBNMF are slightly lower than those of KBMF and HGLMF on CI and DC datasets, respectively, and VKBNMF achieves significantly higher hit rates on the remaining two datasets.Under the "Viral Protein" scenario, VKBNMF showed a higher hit rate on all four benchmarks.Finally, we take three common viruses as case studies to further verify the effectiveness of our method.
However, VKBNMF still has some aspects worthy of further study.Firstly, to facilitate the solution of the model, we select common conjugate priors, such as multivariate Gaussian distribution and Gamma distribution.The following research plans to try some other effective prior distributions.Secondly, for the purpose of model evaluation, we separately studied human-virus PPIs in different diseases, ignoring the relationship between different diseases.In the future, we plan to establish an integrated prediction model combining disease types and human-virus PPIs.

Figure 1 .
Figure 1.The overall workflow of VKBNMF for predicting of potential human-virus PPIs.

Figure 3 .
Figure 3.Effect of significance level c on model prediction performance.
[•] represents rounding.S cand ([ρ • N]) represents the top [ρ • N] PPIs with the highest predicted scores, and S Test represents the actual PPIs in the test set.

Figure 4 .
Figure 4. Comparison of model prediction performance for the top 2% hit rate.

Figure 5 .
Figure 5.The values of AUC, AUPR, and F1 by VKBNMF under 20 random seeds of fivefold cross validation.

Table 2 .
Comparison of the prediction performance under "Pairwise interaction" scenario.The numbers in bold represent the optimal values of the current indicator.

Table 3 .
The P-value of the paired Wilcoxon rank sum test of VKBNMF with other predictive models.

Table 4 .
The top 10 PPIs of Epstein-Barr virus identified by VKBNMF.

Table 5 .
The top 10 PPIs of Influenza A virus identified by VKBNMF.18% higher than the 0.3237 of KBMF (ranked second); the hit rate on DC is 0.3254, slightly lower than the 0.3421 of HGLM; the hit rate on ED is 0.2698, which is an increase of 26.13% compared to 0.2139 of WHGMF (ranked second); the hit rate on VM is 0.5038, which is 9% higher than 0.4622 of HGLMF (ranked second).For the "new virus protein" scenario, VKBNMF shows the best performance on all four benchmark datasets.Especially for the DC data set, the top 2% hit rate of VKBNMF exceeds 0.6.Supplemental tables S1 and S2 show the top 2%, 6%, and 10% hit rates of all methods in the "new human protein" and "new viral protein" scenarios, respectively.

Table 6 .
The top 10 PPIs of Influenza A virus identified by VKBNMF.